Japanese / English

Detail of Publication

Text Language English
Authors Rina Buoy, Masakazu Iwamura, Sovila Srun, Koichi Kise
Title ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducerfor Scene Text Recognition
Journal Journal of Imaging
Vol. 9
No. 12
Number of Pages 17 pages
Publisher MDPI
Reviewed or not Reviewed
Month & Year December 2023
Abstract Attention-based encoder?decoder scene text recognition (STR) architectures have been proven effective in recognizing text in the real world, thanks to their ability to learn an internal language model. Nevertheless, the cross-attention operation that is used to align visual and linguistic features during decoding is computationally expensive, especially in low-resource environments. To address this bottleneck, we propose a cross-attention-free STR framework that still learns a language model. The framework we propose is ViTSTR-Transducer, which draws inspiration from ViTSTR, a vision transformer (ViT)-based method designed for STR and the recurrent neural network transducer (RNN-T) initially introduced for speech recognition. The experimental results show that our ViTSTR-Transducer models outperform the baseline attention-based models in terms of the required decoding floating point operations (FLOPs) and latency while achieving a comparable level of recognition accuracy. Compared with the baseline context-free ViTSTR models, our proposed models achieve superior recognition accuracy. Furthermore, compared with the recent state-of-the-art (SOTA) methods, our proposed models deliver competitive results.
DOI 10.3390/jimaging9120276
Back to list